<p float="left">
  <img src="PlantHub-full-rgb.png" style="height:100px" alt="PlantHub logo" class="light-logo">
  <img src="PlantHub-full-white.png" style="height:100px" alt="PlantHub logo" class="dark-logo">
  <img src="gfoe.png" style="height:100px" alt="gfö logo" class="light-logo">
  <img src="gfoe_inv.png" style="height:100px" alt="gfö logo" class="dark-logo">
  <img src="NFDI4Biodiversity.svg" style="height:100px" alt="NFDI logo" class="light-logo">
   <img src="NFDI4Biodiversity_text_inv.svg" style="height:100px" alt="NFDI 2023 logo" class="dark-logo">
</p>

# Taxonomic name resolution

This notebook is intended to demonstrate the workflow of taxonomic name resolution. Taxonomic name resolution is the process of checking and de-synonymizing taxonomic names using a reference database. This is necessary because many taxonomic entities, such as species, are known and have been described using several scientific names. However, only one of those scientific names is the <b>accepted name</b>, while the others are referred to as <b>synonyms</b> (although there are special cases, e.g., unresolved names). To correctly link data from several sources, it is necessary to check and de-synonymize names using a common reference database.

This process is complicated by spelling errors, names not found in the database, and other challenges. In this notebook, we will utilize the powerful [GBIF API](https://techdocs.gbif.org/en/openapi/) to resolve the names from a sample list. However, users may not always want to apply the GBIF taxonomy, and not all datasets incorporated in GBIF are the newest versions. Therefore, in the hands on-part of this notebook, we will attempt to create a custom name resolution function using the [Leipzig Catalogue of Vascular Plants (LCVP)](https://planthub.idiv.de/LCVP/). We will also strive to speed up the name resolution process by using the parallel processing functionality of R.

## Prerequisites

To run the code presented here, you will need 
- the sample names list provided in the workshop,
- a version of the Leipzig Catalogue of Vascular Plants (LCVP), also provided in the workshop,
- a functioning R environment and 
- the R packages `data.table`, `rgbif`, and `doSNOW` installed.

## Code

The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.

In [1]:
# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing

# clear workspace
rm(list = ls())

# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))

# load data
plants <- fread("plant names_2024-04-08.txt", sep = "\t")
animals <- fread("animal names_2024-04-09.txt", sep = "\t")


Lade nötiges Paket: foreach

Lade nötiges Paket: iterators

Lade nötiges Paket: snow



For the sake of simplicity, we will proceed to do the name matching using the names as they are. Note that name parsing can help to increase the accuracy and efficiency of the name harmonization process.

### Encoding

Unfortunately, when getting data from differing sources, we will often find that these data have been [encoded](https://en.wikipedia.org/wiki/Binary-to-text_encoding) in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system. 

We will deal with the most common case: Data being stored in the Windows-specific [CP-1252 encoding](https://en.wikipedia.org/wiki/Windows-1252) (mislabeled ANSI or latin1 sometimes) and not in [UTF-8](https://en.wikipedia.org/wiki/UTF-8).

How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:

In [2]:
Sys.getlocale()


If your console has no UTF-8 setting (no matter the language) you may change it like this: 

`Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")`

You can use another encoding, too, but it may throw errors later on. So let's check whether the data comes in UTF-8, and if not, let's repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).

In [3]:
# check whether correct encoding is UTF-8
table(validUTF8(plants$oldName))
table(validUTF8(animals$modName))



FALSE  TRUE 
   73  4927 


TRUE 
5000 

In [4]:
# create new columns for variables
plants[, newName := oldName]
animals[, newName := modName]
# correct encoding, assuming current encoding is CP-1252
plants[!validUTF8(newName), newName := iconv(newName, from = "CP1252", to = "UTF-8")]


### Name resolution

To access the GBIF backbone, we can use the `name_backbone_checklist` function found in the rgbif package.

In [5]:
resP <- data.table(name_backbone_checklist(plants$newName))
resA <- data.table(name_backbone_checklist(animals$newName))


It took some time, but we got some results. Let's look at the result structure.

In [6]:
str(resP)


Classes 'data.table' and 'data.frame':	5000 obs. of  25 variables:
 $ confidence      : int  100 100 100 98 94 98 94 99 99 99 ...
 $ matchType       : chr  "NONE" "NONE" "NONE" "EXACT" ...
 $ synonym         : logi  FALSE FALSE FALSE TRUE FALSE TRUE ...
 $ usageKey        : int  NA NA NA 2977925 3152705 2685510 2684876 8132859 3138531 3830289 ...
 $ acceptedUsageKey: int  NA NA NA 11698476 NA 2685508 NA NA NA NA ...
 $ scientificName  : chr  NA NA NA "Abarema curvicarpa (H.S.Irwin) Barneby & J.W.Grimes" ...
 $ canonicalName   : chr  NA NA NA "Abarema curvicarpa" ...
 $ rank            : chr  NA NA NA "SPECIES" ...
 $ status          : chr  NA NA NA "SYNONYM" ...
 $ kingdom         : chr  NA NA NA "Plantae" ...
 $ phylum          : chr  NA NA NA "Tracheophyta" ...
 $ order           : chr  NA NA NA "Fabales" ...
 $ family          : chr  NA NA NA "Fabaceae" ...
 $ genus           : chr  NA NA NA "Jupunba" ...
 $ species         : chr  NA NA NA "Jupunba curvicarpa" ...
 $ kingdomKey     

As there is a column named "kingdom", we might check whether we actually got all plants matched, and how many non-matches we got. Another important information can be found in the "matchType" column. Here, we can see whether names were retrieved exactly as they wer spelled, or some fuzzy matching was done, or whether they could only be matched to a higher rank. The latter means that names may only have been matched to genus, familiy, order, or phylum level. It is worth checking these results.

In [7]:
table(resP$kingdom)
sum(is.na(resP$kingdom))
table(resP$matchType)



Animalia    Fungi  Plantae 
       1       40     4814 


     EXACT      FUZZY HIGHERRANK       NONE 
      4550        138        167        145 

In [8]:
resP[kingdom != "Plantae"]


confidence,matchType,synonym,usageKey,acceptedUsageKey,scientificName,canonicalName,rank,status,kingdom,⋯,kingdomKey,phylumKey,classKey,orderKey,familyKey,genusKey,speciesKey,class,verbatim_name,verbatim_index
<int>,<chr>,<lgl>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>
99,EXACT,False,3414350,,Acarospora radicata H.Magn.,Acarospora radicata,SPECIES,ACCEPTED,Fungi,⋯,5,95,180,1271.0,8347.0,2600495,3414350,Lecanoromycetes,Acarospora radicata,63
99,EXACT,False,2592767,,"Arthonia polymorpha Ach., 1814",Arthonia polymorpha,SPECIES,ACCEPTED,Fungi,⋯,5,95,313,1273.0,8363.0,2581942,2592767,Arthoniomycetes,Arthonia polymorpha,420
99,EXACT,False,7250670,,Aspicilia candida (Anzi) Hue,Aspicilia candida,SPECIES,ACCEPTED,Fungi,⋯,5,95,180,1051.0,4116.0,2599747,7250670,Lecanoromycetes,Aspicilia candida,445
99,EXACT,False,3419869,,Aulaxina microphana (Vain.) R.Sant.,Aulaxina microphana,SPECIES,ACCEPTED,Fungi,⋯,5,95,180,1279.0,2161.0,6610216,3419869,Lecanoromycetes,Aulaxina microphana,507
98,EXACT,True,2608288,3438532.0,Bacidia kingmanii Hasse,Bacidia kingmanii,SPECIES,SYNONYM,Fungi,⋯,5,95,180,10848190.0,8296.0,2569865,3438532,Lecanoromycetes,Bacidia kingmanii,527
99,EXACT,False,2609464,,Buellia griseovirens (Turner & Borrer ex Sm.) Almb.,Buellia griseovirens,SPECIES,ACCEPTED,Fungi,⋯,5,95,180,10861608.0,4115.0,2587707,2609464,Lecanoromycetes,Buellia griseovirens,711
99,EXACT,False,2609287,,Calicium quercinum Pers.,Calicium quercinum,SPECIES,ACCEPTED,Fungi,⋯,5,95,180,10861608.0,4115.0,2592559,2609287,Lecanoromycetes,Calicium quercinum,789
98,EXACT,True,2610162,7462577.0,Caloplaca inconspecta Arup,Caloplaca inconspecta,SPECIES,SYNONYM,Fungi,⋯,5,95,180,1050.0,8368.0,7251287,7462577,Lecanoromycetes,Caloplaca inconspecta,805
98,EXACT,True,3469366,2596842.0,Catapyrenium caeruleopulvinum J.W.Thomson,Catapyrenium caeruleopulvinum,SPECIES,SYNONYM,Fungi,⋯,5,95,178,1043.0,4841.0,2596841,2596842,Eurotiomycetes,Catapyrenium caeruleopulvinum,929
99,EXACT,False,7186902,,Cetraria islandica subsp. islandica,Cetraria islandica islandica,SUBSPECIES,ACCEPTED,Fungi,⋯,5,95,180,1048.0,8305.0,2601309,2605272,Lecanoromycetes,Cetraria islandica ssp. islandica,989


The only "animal" in this dataset is <i>Triphora minima</i>, which also happens to be an orchid species. To get the correct match, we could re-run the query for this particular species adding the parameter `kingdom = "Plantae"`. You should not do this initially, if you are not 100% sure your data belongs to a certain group. The species labelled as fungi may rather be lichen, as in the case of <i>Usnea</i>, but their classification is plausible, as it is known that the TRY database includes lichens.

Let's check the "HIGHERRANK" matches. ("scientificName" includes authors, "canonicalName" is without authors, and "verbatim_name" is the name queried.)

In [9]:
resP[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]


scientificName,canonicalName,verbatim_name
<chr>,<chr>,<chr>
Abies Mill.,Abies,Abies sp
Acacia Mill.,Acacia,Acacia contriva
Acacia Mill.,Acacia,Acacia sp1887
Aconitum L.,Aconitum,Aconitum napellus L. Orthodox
Adenanthera L.,Adenanthera,Adenanthera sp.
Aesculus L.,Aesculus,Aesculus xworlitzensis
Aloinopsis Schwantes,Aloinopsis,Aloinopsis gydouwensis
Alyssum L.,Alyssum,Alyssum thunbergii Moq. Orthodox p
Ampelocera Klotzsch,Ampelocera,Ampelocera indet
Asphodeline Rchb.,Asphodeline,Asphodeline fistulosus


A large part of the matches are such where the species epithet is given as "sp/sp.", or where authors may be interpreted as parts of the epithets. Preprocessing the names (see name parsing) may help mitigate these issues.



In [10]:
name_backbone("Aconitum napellus")


Unnamed: 0_level_0,usageKey,scientificName,canonicalName,rank,status,confidence,matchType,kingdom,phylum,order,⋯,kingdomKey,phylumKey,classKey,orderKey,familyKey,genusKey,speciesKey,synonym,class,verbatim_name
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>,<chr>,<chr>
1,3033665,Aconitum napellus L.,Aconitum napellus,SPECIES,ACCEPTED,97,EXACT,Plantae,Tracheophyta,Ranunculales,⋯,6,7707728,220,399,2410,3033663,3033665,False,Magnoliopsida,Aconitum napellus


Let's check the animals now.

In [11]:
table(resA$kingdom)
sum(is.na(resA$kingdom))
table(resA$matchType)



Animalia 
    4821 


     EXACT      FUZZY HIGHERRANK       NONE 
      4408        282        131        179 

Here, classification into animals was unambiguous. Let's check the higher rank matches.

In [12]:
resA[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]


scientificName,canonicalName,verbatim_name
<chr>,<chr>,<chr>
"Accipiter Brisson, 1760",Accipiter,"Accipiter spp.-5 Rothschild & Hartert, 1926"
Achillides chikae,Achillides chikae,Achillides chikae chikae
Achillides chikae,Achillides chikae,"Achillides chikae hermeli Nuyda, 1992"
Animalia,Animalia,"Acropora morphospec1 Veron & Wallace, 1984"
Acroporidae,Acroporidae,"Acroporidae_Acropora cuneata (Dana, 1846)"
Animalia,Animalia,"Acropora josephi George & Sukumaran, 2007"
Animalia,Animalia,"Acropora mannaqensis George & Sukumaran, 2007"
Acroporidae,Acroporidae,"Acroporidae Acropora subglabra (Brook, 1891)"
Acroporidae,Acroporidae,"Acroporidae_Acropora tanegashimensis Veron, 1990"
Animalia,Animalia,"Acropora thomasi George & Sukumaran, 2007"


There are some problems with the classification of specific genera, like <i>Acropora</i>, and there are problems with genera that are erronously repeated within names, like <i>Smaug smaug breyeri</i>. Additionally, family names before the actual scientific names should also be removed. These are issues that can also be alleviated during pre-processing.

### Writing your own name resolution function

Sometimes, you may want to check datasets against a very specific reference database, or your name resolution service of choice may not use the newest version of your reference dataset. In this case, you can write your own name resolution algorithm. Be aware: There are a lot of caveats in this process, and nowadays, it will not be easy to write a function that matches the correctness and efficiency of services like GBIF. Especially when it comes to speed and large datasets, it is unlikely a function that cannot be used in parallel on your own machine or a high performance cluster will deliver results for big datasets within an acceptable time frame.

Let's imagine we want to check our plants dataset against the newest version of the [Leipzig Catalogue of Vascular Plants (LCVP)](https://planthub.idiv.de/LCVP/). 

In [13]:
# read in dataset
LCVP <- fread(paste0(.brd, "PlantHub/LCVP_PlantHub_2024-01-25.gz"))
str(LCVP)


Classes 'data.table' and 'data.frame':	1337778 obs. of  21 variables:
 $ global Id                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Input Genus               : chr  "Aa" "Aa" "Aa" "Aa" ...
 $ Input Epitheton           : chr  "argyrolepis" "aurantiaca" "brevis" "calceata" ...
 $ Rank                      : chr  "species" "species" "species" "species" ...
 $ Input Subspecies Epitheton: chr  "" "" "" "" ...
 $ Input Authors             : chr  "(Rchb.f.) Rchb.f." "D.Trujillo" "Schltr." "(Rchb.f.) Schltr." ...
 $ Status                    : chr  "accepted" "accepted" "synonym" "accepted" ...
 $ globalId of Output Taxon  : int  1 2 819078 4 819080 6 7 8 9 10 ...
 $ Output Taxon              : chr  "Aa argyrolepis (Rchb.f.) Rchb.f." "Aa aurantiaca D.Trujillo" "Myrosmodes breve (Schltr.) Garay" "Aa calceata (Rchb.f.) Schltr." ...
 $ family                    : chr  "Orchidaceae" "Orchidaceae" "Orchidaceae" "Orchidaceae" ...
 $ Order                     : chr  "Asparagales" "Asparagales" "Asp

This is an enhanced version of LCVP 2.0, with some errors corrected, and some data added. It includes ASCII-only columns nameIn, authorsIn, nameOut, and authorsOut, as well as links to IPNI, POWO, WFO, and WorldPlants.

Let's set up a simple matching algorithm. You may work on it to include some of the main problematic cases.

First, let's tune our reference list. We would like to be able to identify families and genera before the actual matching, and to do this efficiently, we can extract those from LCVP. We should also be able to directly match complete names with authors, so let's create a column with those, too.

In [14]:
# genera
genera <- sort(unique(sub("\\s.*", "", LCVP$nameIn)))
genera <- genera[genera != ""]
genera[1:10]
# families
families <- sort(unique(LCVP$family))
families <- families[families != ""]
families[1:10]
# name + author column
LCVP[, fullNameIn := trimws(paste(nameIn, authorsIn))]
LCVP$fullNameIn[1:10]


Let's prepare a results table. For simplicity, we will store the ID of matches in LCVP when a name was found, and indicate whether it is a genus or family found in LCVP (without ID) otherwise.

In [15]:
resTable <- data.table(name = plants$newName, LCVP_ID = numeric(), LCVP_genus = logical(), LCVP_family = logical())


"Item 2 has 0 rows but longest item has 5000; filled with NA"
"Item 3 has 0 rows but longest item has 5000; filled with NA"
"Item 4 has 0 rows but longest item has 5000; filled with NA"


We can ignore the warnings which just tell us that the LCVP columns in the results table are empty for now. Let's fill the table.

In [16]:
# test whether names found in genera
which(plants$newName %in% genera)
# test whether names found in families
which(plants$newName %in% families)

# write data into results
resTable[plants$newName %in% genera, LCVP_genus := TRUE]

resTable[plants$newName %in% families, LCVP_family := TRUE]

# test whether names in nameIn, i.e. names without authors
which(plants$newName %in% LCVP$nameIn)
# test whether names in fullNameIn, i.e. names with authors
which(plants$newName %in% LCVP$fullNameIn)


As we can see, there are many matches both when searching with and without authors. However, for names without authors, there may be more than one name in the reference list (they are called homonyms). Only one of those will be an accepted name, while the others are synonyms. As the matched names without authors do not allow for a disambiguation, we will assign the ID of the accepted name from the reference list, if there are several. To do this, we create a copy of the reference list, order by taxonomic status so that accepted names come first, and only keep the first of several rows with identical names without authors. We then use this reduced list to extract the respective IDs.

In [17]:
# create a copy of the reference list
LCVPUnique <- LCVP
# order by taxonomic status
setorder(LCVPUnique, status)
# keep only the first of several rows with identical names without authors
LCVPUnique <- unique(LCVPUnique, by = "nameIn")

# check whether it worked
nrow(LCVPUnique)
nrow(LCVP)


We removed about 80000 names from LCVP in this process. Let's now get the IDs.

In [18]:
# write data into results, extract LCVP ID
setkey(LCVP, fullNameIn)
res <- LCVP[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]

setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]


We can now check what remains from the names in our list. The remainder will be the difficult part where the algorihm used actually matters. 

In [19]:
plants[, matched := FALSE]
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)



FALSE  TRUE 
 1180  3820 

From the 5000 names we had to check, 1180 remain to be tested. This is a relatively large number, as this dataset is especially messy, but good for us to practice. We should have a look at the unmatched names.

In [20]:
plants[matched == FALSE]$newName[1:20]


It seems that in the first place, we should get rid of author names, as there spelling may be different from the one in LCVP and therefore not produce a match. A very simple way of doing so would be to cut names after the second whitespace.

In [21]:
# function to extract first two words
nameShorter <- function(x) {
	# get number of whitespaces
	ws <- gregexpr(" ", x)
	# get position of second whitespace if available, otherweise return 0
	ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
	x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
	return(x)
}
print(nameShorter(plants$newName[40:50]))


 [1] "Acacia tortilis"                  "Acacia valida"                   
 [3] "Acacia yorkrakinensis"            "Acacia auriculiformis A.Cunn. ex"
 [5] "Acacia colei Maslin &"            "Acacia drummondii Lindl."        
 [7] "Acacia hadrophylla R.S.Cowan &"   "Acacia lazaridis Pedley"         
 [9] "Acacia neriifolia A.Cunn. ex"     "Acacia ptychoclada Maiden &"     
[11] "Acacia speckii R.S.Cowan &"      


We see that some of the shortNames are not as expected. Thare are still author names linked to them. The reason is that there are protected whitespaces in there. We need to remove them first.

In [22]:
# function to extract first two words
nameShorter <- function(x) {
	# remove protected whitespaces
	x <- gsub("\xc2\xa0", " ", x)
	# get number of whitespaces
	ws <- gregexpr(" ", x)
	# get position of second whitespace if available, otherweise return 0
	ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
	x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
	return(x)
}
print(nameShorter(plants$newName[40:50]))


 [1] "Acacia tortilis"       "Acacia valida"         "Acacia yorkrakinensis"
 [4] "Acacia auriculiformis" "Acacia colei"          "Acacia drummondii"    
 [7] "Acacia hadrophylla"    "Acacia lazaridis"      "Acacia neriifolia"    
[10] "Acacia ptychoclada"    "Acacia speckii"       


This looks much nicer. We can now do the name matching without authors again. Note that, ideally, we would not just match without authors, but also measure the difference between author names so that we actually select the closest match.

In [23]:
plants[, shortName := nameShorter(newName)]

setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$shortName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]

# update the "notMatched" column
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)



FALSE  TRUE 
  581  4419 

As we can see, the number of unmatched names was reduced from 1211 to 581. We could now introduce some fuzzy matching, i.e. try to assign names from the reference list to names with spelling errors. Of course we could also consider other pre-processing options: correcting the uppercase/lowercase of the names, removing special characters like question marks or underlines, removing "sp.", etc.. 

In [24]:
plants[matched == FALSE]$shortName[1:50]


The fuzzy matching will be done in a loop using a matching function. Later on, this will allow us to easily switch to parallel processing. The below function first checks for the presence of the first word, assumed to be the genus, in the reference list. If it is found, the fuzzy matching will only be done on the species belonging to this genus, massively reducing the computation time. Then, fuzzy matching is done, the best result(s) selected and the first of the best results or no result returned (in case there is none).

In [25]:
# create a template to return in case there is no match
resTemplate <- LCVPUnique[1]
resTemplate[1] <- NA

# function for fuzzy matching
# maxDist controls the Levenshtein distance, i.e. the difference between the given and matched name
nameMatcher <- function(x, maxDist = 2) {
	genus <- sub("\\s.*", "", x$shortName)
	if (genus %in% genera) {
		checkRows <- sub("\\s.*", "", LCVPUnique$nameIn) == genus
	} else {
		checkRows <- rep(TRUE, nrow(LCVPUnique))
	}
	# do fuzzy matching
	res <- LCVPUnique[checkRows][agrepl(paste0("^", x$shortName, "$"), LCVPUnique$nameIn[checkRows],
		max.distance = maxDist, fixed = FALSE
	)]
	if (nrow(res) > 0) {
		# calculate Levenshtein distance
		dists <- adist(x$shortName, res$nameIn)
		# keep best result(s)
		res <- res[as.vector(dists == min(dists))]
		# return first result or return template
		return(res[1])
	} else {
		return(resTemplate)
	}
}


Let's run this function on some of the remaining unmatched names. As this may take a while, we will only loop over the first 200 names. You may run it on the whole dataset, but expect it to take about half an hour. Running on the first 200 names will just take a minute.

In [26]:
timeStart <- Sys.time()
# for (i in seq_len(nrow(plants))) {
for (i in seq_len(200)) {
	# only check unmatched cases
	if (plants$matched[i] == FALSE) {
		# counter to show progress
		print(paste(i, Sys.time()))
		res <- nameMatcher(plants[i])
		if (!is.na(res$`global Id`)) {
			resTable[i, LCVP_ID := res$`global Id`]
			plants[i, matched := TRUE]
		}
	}
}
Sys.time() - timeStart


[1] "1 2024-04-12 11:58:24.400659"
[1] "2 2024-04-12 11:58:28.823094"
[1] "3 2024-04-12 11:58:34.227135"
[1] "7 2024-04-12 11:58:37.908572"
[1] "10 2024-04-12 11:58:38.654882"
[1] "19 2024-04-12 11:58:42.660404"
[1] "38 2024-04-12 11:58:43.418891"
[1] "63 2024-04-12 11:58:44.179911"
[1] "70 2024-04-12 11:58:48.752534"
[1] "87 2024-04-12 11:58:49.515793"
[1] "103 2024-04-12 11:58:50.269285"
[1] "125 2024-04-12 11:58:51.013671"
[1] "128 2024-04-12 11:58:56.107607"
[1] "133 2024-04-12 11:59:00.12651"
[1] "141 2024-04-12 11:59:00.888705"
[1] "146 2024-04-12 11:59:01.645952"
[1] "158 2024-04-12 11:59:02.720926"
[1] "174 2024-04-12 11:59:07.297597"
[1] "187 2024-04-12 11:59:08.048403"
[1] "189 2024-04-12 11:59:08.804925"
[1] "191 2024-04-12 11:59:09.561838"


Time difference of 49.24616 secs

In [27]:
plants[c(1, 10, 19, 128, 133)]


oldName,newName,matched,shortName
<chr>,<chr>,<lgl>,<chr>
,,False,
Abuta_panamensis,Abuta_panamensis,True,Abuta_panamensis
Acacia contriva,Acacia contriva,False,Acacia contriva
Aeranthus muscicola,Aeranthus muscicola,True,Aeranthus muscicola
Aesculus xworlitzensis,Aesculus xworlitzensis,True,Aesculus xworlitzensis


What we see from the times needed per individual run is that whenever the genus is found, matching is relatively fast, taking about a second, but when this is not the case, it is quite slow. This is because agrepl() then works on the whole LCVPUnique dataset and has to compare more than one million pairs of words. You could think about a heuristic to reduce the number of rows checked.

### Parallel processing

We will now focus on speeding up the process of name checking by running it in parallel. Let's check how many cores are available on the system.

In [28]:
parallel::detectCores()


On my machine, I can at maximum use 16 cores. That means that I can expect a more or less 16-fold increase in processing time. Assuming that the matching of all unmatched names of the 5000 row dataset would take 30 minutes when running it sequentially, that means that I can expect the task to be completed in about 2 minutes when running in parallel. However, there comes a cost with it: When running processes in parallel, R will copy all the objects in the workspace needed for each parallel process, and in our cases, that means copying LCVPUnique 16 times. This will take quite some time, and for few iterations of the loop, initializing the parallel process will take more time than is saved by running in parallel. Anyway, we will first try the first 200 names we already processed before (but note that the matched ones will not be done again, because they are matched).

We also need to make some adjustments to the code. As the parallel processes will use their copies of the data, it would not make sense to let them write to the individual copies. Therefore, if a match is found, the information needs to be returned to the main process. Also, as objects are copied for individual workers, the information needed for the individual processes should be kept minimal. I will also not make use of all available cores to make sure I can do others stuff on my computer without delay while the process is running.

In [29]:
# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(200), .combine = c, .packages = c("data.table")) %dopar% {
	# only check unmatched cases
	# return the global Id if it is found and NA if not checked or nothing could be found
	if (plants$matched[i] == FALSE) {
		res <- nameMatcher(plants[i])
		res <- res$`global Id`
	} else {
		res <- NA
	}
	# indicate what to return
	res
}
Sys.time() - timeStart
# stop the cluster
stopCluster(cl)


Time difference of 3.274923 mins

The process took about 3.5 mins for me with 15 cores, so no gain in terms of time for now. Let's see what we got.

In [30]:
resTemp


As we already ran the process sequentially, new matches were not to be expected on the first 200 entries. Let's risk running the process on the whole dataset.

In [31]:
# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(nrow(plants)), .combine = c, .packages = c("data.table")) %dopar% {
	# only check unmatched cases
	# return the global Id if it is found and NA if not checked or nothing could be found
	if (plants$matched[i] == FALSE) {
		res <- nameMatcher(plants[i])
		res <- res$`global Id`
	} else {
		res <- NA
	}
	# indicate what to return
	res
}
Sys.time() - timeStart

# stop the cluster
stopCluster(cl)


Time difference of 7.922978 mins

This took about 7.5 mins. That's a big improvement compared to the sequential process. Let's see how many matches we got.

In [32]:
table(!is.na(resTemp))



FALSE  TRUE 
 4749   251 

So out of the 581 names, another 251 could be matched. Let's transfer the data into resTable. It might well be worth thinking about adding information on type of matching, as the fuzzy matches are not perfect any might require further checking. However, our current implementation does not give us any information on the type of match.

In [33]:
resTable[!is.na(resTemp), LCVP_ID := resTemp[!is.na(resTemp)]]
plants[!is.na(resTable$LCVP_ID), matched := TRUE]


Let's just look at the results and the remaining names. Maybe you can figure out some possible improvements to the code.

>TASKS:
>1) For example, you could think about allowing for partial matches, if the genus is found, but not the species. This could easily be implemented by extracting the first word from the shortName column.
>2) You could also play with the `maxDist` parameter to increase or decrease the Levenshtein distance. 
>3) To improve the speed of the `nameMatcher` function, you would certainly have to filter potential matches, for example by only including names starting with a certain letter (assuming the first letter is correct), or by only including names with a certain number of characters. 
>4) Finally, the code would be more efficient if you would only loop over the rows that have not been matched yet.

In [34]:
table(plants$matched)
plants[matched == FALSE][1:20]



FALSE  TRUE 
  319  4681 

oldName,newName,matched,shortName
<chr>,<chr>,<lgl>,<chr>
,,False,
(lauraceae) pubescente,(lauraceae) pubescente,False,(lauraceae) pubescente
?Betulaceae sp.,?Betulaceae sp.,False,?Betulaceae sp.
Abies sp,Abies sp,False,Abies sp
Acacia contriva,Acacia contriva,False,Acacia contriva
Acacia sp1887,Acacia sp1887,False,Acacia sp1887
Acarospora radicata,Acarospora radicata,False,Acarospora radicata
Adenanthera sp.,Adenanthera sp.,False,Adenanthera sp.
A-Elyhordeum schaackianum,A-Elyhordeum schaackianum,False,A-Elyhordeum schaackianum
Albizia NA,Albizia NA,False,Albizia NA
